pruning function
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > Montserrat (0.04)
- (5 more...)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > Montserrat (0.04)
- (5 more...)
S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
Hu, Yuezhou, Zhu, Jun, Chen, Jianfei
Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In the light of this statement, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpass previous 2:4 pre-training recipes and is comparable even with full parameter models.
- Europe > Monaco (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems
Balcan, Maria-Florina, Nagarajan, Vaishnavh, Vitercik, Ellen, White, Colin
Max-cut, clustering, and many other partitioning problems that are of significant importance to machine learning and other scientific fields are NP-hard, a reality that has motivated researchers to develop a wealth of approximation algorithms and heuristics. Although the best algorithm to use typically depends on the specific application domain, a worst-case analysis is often used to compare algorithms. This may be misleading if worst-case instances occur infrequently, and thus there is a demand for optimization methods which return the algorithm configuration best suited for the given application's typical inputs. We address this problem for clustering, max-cut, and other partitioning problems, such as integer quadratic programming, by designing computationally efficient and sample efficient learning algorithms which receive samples from an application-specific distribution over problem instances and learn a partitioning algorithm with high expected performance. Our algorithms learn over common integer quadratic programming and clustering algorithm families: SDP rounding algorithms and agglomerative clustering algorithms with dynamic programming. For our sample complexity analysis, we provide tight bounds on the pseudodimension of these algorithm classes, and show that surprisingly, even for classes of algorithms parameterized by a single parameter, the pseudo-dimension is superconstant. In this way, our work both contributes to the foundations of algorithm configuration and pushes the boundaries of learning theory, since the algorithm classes we analyze consist of multi-stage optimization procedures and are significantly more complex than classes typically studied in learning theory.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Research Report (0.63)
- Workflow (0.46)
Joint Neural Entity Disambiguation with Output Space Search
Shahbazi, Hamed, Fern, Xiaoli Z., Ghaeini, Reza, Ma, Chao, Obeidat, Rasha, Tadepalli, Prasad
In this paper, we present a novel model for entity disambiguation that combines both local contextual information and global evidences through Limited Discrepancy Search (LDS). Given an input document, we start from a complete solution constructed by a local model and conduct a search in the space of possible corrections to improve the local solution from a global view point. Our search utilizes a heuristic function to focus more on the least confident local decisions and a pruning function to score the global solutions based on their local fitness and the global coherences among the predicted entities. Experimental results on CoNLL 2003 and T AC 2010 benchmarks verify the effectiveness of our model.
Integrating Partial Order Reduction and Symmetry Elimination for Cost-Optimal Classical Planning
Wehrle, Martin (University of Basel) | Helmert, Malte (University of Basel) | Shleyfman, Alexander (Technion, Haifa) | Katz, Michael (IBM Haifa Research Lab)
Pruning techniques based on partial order reduction and symmetry elimination have recently found increasing attention for optimal planning. Although these techniques appear to be rather different, they base their pruning decisions on similar ideas from a high level perspective. In this paper, we propose safe integrations of partial order reduction and symmetry elimination for cost-optimal classical planning. We show that previously proposed symmetry-based search algorithms can safely be applied with strong stubborn sets. In addition, we derive the notion of symmetrical strong stubborn sets as a more tightly integrated concept. Our experiments show the potential of our approaches.
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)